LLM 25-Day Course - Day 17: Building a RAG Pipeline

Day 17: Building a RAG Pipeline

RAG (Retrieval-Augmented Generation) is a key technique for addressing LLM limitations such as hallucination and lack of up-to-date information. Let’s build a pipeline from scratch that retrieves external documents and provides them as evidence for LLM responses.

How RAG Works

RAG consists of 3 stages. (1) Split documents into chunks, embed them, and store in a vector DB. (2) When a user question arrives, retrieve relevant documents via vector similarity search. (3) Insert the retrieved documents as context and have the LLM generate an answer.

Document Loading and Splitting

# pip install langchain langchain-community langchain-openai chromadb

from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load text file
loader = TextLoader("company_docs.txt", encoding="utf-8")
documents = loader.load()

# Split documents into appropriate sizes
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # Maximum characters per chunk
    chunk_overlap=50,     # Overlapping characters between chunks (maintains context)
    separators=["\n\n", "\n", ".", " "],  # Split priority
)
chunks = text_splitter.split_documents(documents)
print(f"Original documents: {len(documents)} -> Split chunks: {len(chunks)}")
print(f"First chunk:\n{chunks[0].page_content[:200]}")

Embedding and Vector Store Creation

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Configure embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Store vectors in ChromaDB (persistent storage on disk)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="company_docs",
)

# Test similarity search
query = "What is the company's refund policy?"
results = vectorstore.similarity_search(query, k=3)
for i, doc in enumerate(results):
    print(f"\n--- Search Result {i+1} (top similarity) ---")
    print(doc.page_content[:200])

Full RAG Pipeline Assembly

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# LLM configuration
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Custom prompt
prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""Answer the question based on the following documents.
If the information is not found in the documents, respond with "The requested information was not found in the provided documents."

Reference documents:
{context}

Question: {question}

Answer:""",
)

# Search + generation pipeline
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},  # Retrieve top 3 documents
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Combine retrieved documents into a single prompt
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True,
)

# Execute query
response = qa_chain.invoke({"query": "Please explain the refund process"})
print("Answer:", response["result"])
print("\nReference documents:", len(response["source_documents"]))

The most important factors in RAG are chunk size and retrieval quality. If chunks are too large, noise gets mixed in; if too small, context is broken. Experiment within a chunk_size range of 300~1000.

Today’s Exercises

Download 3 Wikipedia articles, feed them into a RAG pipeline, and build a Q&A system based on the document content.
Use similarity_search_with_score() to include similarity scores in search results and add threshold-based filtering.
Change chunk_size to 200, 500, and 1000, compare answer quality for the same question, and find the optimal setting.